[OpenVINO] Export DFlash for OpenVINO by ofirzaf · Pull Request #1756 · huggingface/optimum-intel

ofirzaf · 2026-05-30T19:21:02Z

What does this PR do?

We implement the support to export DFlash draft models for speculative decoding with OpenVINO.
Also, we implement hidden_states annotations in exported OV models to better support operations that require hidden_states as outputs from OV models (like DFlash/Eagle3) methods, that will be applied automatically to all models exported for text generation the graph doesn't change as this is only annotations.

Commands to export DFlash model with this PR:

optimum-cli export openvino \
  --model z-lab/Qwen3.6-Coder-35B-A3B-DFlash \
  --task text-generation-with-past \
  --trust-remote-code \
  --dflash-target-model Qwen/Qwen3.6-35B-A3B \
  --disable-convert-tokenizer \
  qwen3.6-35b-a3b-dflash-int8-ov

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

- Introduced `--dflash-target-model` argument for exporting DFlash draft models. - Implemented `update_config_for_dflash` to handle DFlash-specific configurations. - Enhanced model conversion and metadata handling for DFlash models. - Added `DFlashDummyInputGenerator` for generating dummy inputs specific to DFlash. - Updated tests to include DFlash model loading and export functionality. This update enables the export and inference of models utilizing DFlash architecture, enhancing the OpenVINO integration.

- Removed the direct call to `_load_target_weights` in the constructor of `Qwen3DFlashForCausalLM`. - Added a class method `from_pretrained` to handle loading weights and configurations more effectively. - Updated weight handling to ensure compatibility with the target data type. - Modified the `extract_dflash_debug_bundle.py` script to use `dtype` instead of `torch_dtype` and added `attn_implementation` parameter for draft model loading. These changes improve the model's initialization process and enhance the flexibility of loading configurations.

…dels - Introduced functions to check and annotate hidden states in models during export. - Enhanced configuration to include hidden state outputs for models with multiple hidden layers. - Implemented a test suite to validate hidden state annotations in exported OpenVINO models. These changes improve the model export process by allowing the inclusion of hidden states, which is essential for certain text generation tasks.

- Implemented helper functions to find and add model outputs based on tensor names. - Added a new test case to validate that annotated hidden state outputs match those from PyTorch for the GPT-2 model. - Enhanced the export process to include hidden state outputs, ensuring compatibility with text generation tasks. These changes improve the testing framework for OpenVINO model exports, specifically focusing on hidden state annotations.

- Added support for overriding the DFlash block size via the environment variable `DFLASH_BLOCK_SIZE_OVERRIDE`. - Included error handling to ensure the block size is an integer greater than 1. - This enhancement allows for more flexible configuration of DFlash model exports, improving usability and performance. These changes contribute to the ongoing improvements in the OpenVINO export process for DFlash models.

- Added support for committed prefix cache policy in DFlash models by updating runtime information. - Modified `DFlashDummyInputGenerator` to use "hidden_states" instead of "target_hidden" for input names. - Updated Qwen3DFlash model to handle hidden states and past key values more effectively during inference. - Introduced a new script to compare DFlash cache semantics between original and patched models. - Enhanced tests to validate the integration of hidden states and ensure consistency in outputs. These changes improve the functionality and testing of DFlash models within the OpenVINO framework, ensuring better performance and reliability.

ofirzaf added 12 commits May 4, 2026 02:51

Fix DFlash export to support dynamic block size (num_assistant_tokens)

e919101

Merge branch 'main' into dflash-qwen3.5

9ad5d21

Add support for qwen3.5 hidden_states annotations and relevant tests

644f3ab

Fix dflash export where target model is a text model nested in a VLM

e8b4dfb

Test cleanup

ff7c41f

Remove finished todo and left over testing not needed

115b343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenVINO] Export DFlash for OpenVINO#1756

[OpenVINO] Export DFlash for OpenVINO#1756
ofirzaf wants to merge 12 commits into
huggingface:mainfrom
ofirzaf:dflash-qwen3.5

ofirzaf commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ofirzaf commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ofirzaf commented May 30, 2026 •

edited

Loading